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g Abstract 

In this paper will be presented methodology of encoding information in 
*0 valuations of discrete lattice with some translational invariant constrains in 

I asymptotically optimal way. The method is based on finding statistical de- 

scription of such valuations and changing it into statistical algorithm, which 
25 allows to construct deterministically valuation with given statistics. Optimal 

statistics allow to generate valuations with uniform distribution - we get max- 
imum information capacity this way. It will be shown that we can reach the 
optimum for one-dimensional models using maximal entropy random walk and 
O that for the general case we can practically get as close to the capacity of the 

J> model as we want (found numerically: lost 10~^^bit/nodc for Hard Square). 

^ There will be also presented simpler alternative to arithmetic coding method 

;h which can be used as cryptosystem and data correction method too. 



1 Introduction 

Consider all projections 1? — > {0, 1}. In this way we can store 1 bit/node (point 
of the space). Now introduce some constrains, for example: there cannot be two 
neighboring "1" (each node has 4 neighbors) - it's so called Hard Square model(HS). 
It will occur that this restriction reduces the informational capacity to Hhs — 
0.5878911617753406 bits/node. 

The goal of this paper is to introduce methodology of encoding information in 
such models as near their capacity as required. 

We will call a model such triplet - space (Z^), alphabet ({0, 1}) and some con- 
strains. It's elements are all projections fulfilling the constrains - we can think about 
them as valuations of nodes. Now the number of all such valuations over some finite 
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set of nodes(74) will asymptotically grow exponentially = 2*"^^. Because in the 
possibility of choosing one of choices can be stored Ig(A^) bits, this H (Shannon's 
entropy) is the maximal capacity in bits/node we can achieve. 

We can really store Ig(iV) bits in choosing one of N choices, only if all of 
them are equally probable only. So to get the whole available capacity, we have 
to make that all possible valuations are equally probable. Unfortunately the 
space of valuations over infinite space is usually quite complicated. But thanks of 
translational symmetry, elements should have the same local statistical behavior. 
If we find it and valuate the space accordingly, we should get near to the uniform 
distribution over all elements. The statistical algorithm have to encode some 
information in generating some specific valuation, fulfilling the optimal statistics of 
the space. 

Statistical description (p) is a function, which for every finite set {shape) gives the 
probability distribution of valuations on it (patterns). Thanks of the translational 
invariance, we can for example write p{01) - the probability that while taking any 
two neighboring nodes, they give '01' pattern. In one dimension we can find the 
optimal statistical description using pure combinatorics. In higher it's much more 
complicated, but we can for example divide the space into short stripes, create new 
alphabet from their valuations and just use the one-dimensional method. 

Having the statistical description, we can use it to construct the statistical 
algorithm. For example divide the space into straps and valuate them succeedingly. 
Now for succeeding nodes, depending on the valuations of already valuated 
neighbors, we get some probability from created previously table. According to this 
probability we valuate the node, encoding some information. 

Examples of usage: We can think about for example hard disk, locally as 
valuating nodes (let say - magnetizing round dots) of two-dimensional lattice with 
or 1 (without constrains). 




Figure 1: Rescaling the lattice without changing the size of magnetic dot. 

Now let us change the lattice constant, as on fig. 1 - we have -\/2^ = 2 times 
more nodes, but we get some constrains - like in HS - so the capacity is now: 
2 * 0.587 = 1.17 - wc get 17% greater capacity. We've got it by more precise 
positioning of the head - it's technically easier to achieve than shrinking the dot. 
We will see that going further, we can increase the capacity potentially to infinity. 



1 INTRODUCTION 



3 



We can use statistical algorithm approach also to generate random states (e.g. 
spin alignment) in statistical physics. Usually we use Monte-Carlo methods - to 
generate "good" alignment we have to make many runs (the more, the better). 

But using for example many of such alignments, we can approximate its (local) 
statistical description with assumed "goodness". Now using statistical algorithm, 
we can generate so "good" alignments in one run. 

In the second section we will see how to solve analytically the one-dimensional 
model - find its optimal description and capacity. We will motivate that it should 
be Shannon's entropy. To find the optimal description we will just average over all 
elements. In this case the statistical algorithm will be just Markov process - we will 
valuate node by node from left to right and the probability distribution for a node is 
found using only the valuation of the previous one. We get this way random walk on 
a graph (of symbols), which maximizes global entropy ([H]). This approach can be 
generalized to other than uniform distributions of sequences, by introducing some 
vertex/edge potentials. 

In the third section there will be presented asymmetric numeral systems - a 
generalization of numeral systems, which are optimized for encoding sequences of 
equiprobable digits into which the probability distribution of digits is given. It's 
natural way to encode data using given statistical algorithm. This algorithm can 
be alternative for widely used arithmetic coding method: in one table check it 
compress/decompress a few bits (a symbol) and have option that the output is 
encrypted, probably very well. It has also very nice data correction properties. 

In the fourth section there will be introduced formality for general models. 
It will be shown that for "reasonable" models: X = Z", translative invariant con- 
strains with finite range and which are "simple" - valuation of some nodes cannot 
enforce valuation of distant enough ones - we can speak about their entropy, which 
is positive. 

In the fifth section we will introduce methodology of statistical description. 
Now we will define the optimal description straightforward as the average over all 
elements. Unfortunately, in spite of giving many arguments, I couldn't prove exis- 
tence of such average - we will assume it and see that it's equivalent to vanishing 
of long-range correlation. There will be also introduced alternative definition of 
optimality (LOG) - in a finite set of nodes, separated topologically from the rest 
of the space, all valuations are equally probable. Then there will be discussed how 
to generate elements for given statistical description ('almost' statistical algorithm): 
fix some order for nodes and get probability distribution for a node using valuation 
of the previous ones. 

In the sixth section we will analyze some algorithms as above, but this time 
there can be infinite number of previous nodes. We will assume that the probability 
can be determined only by valuation of neighboring nodes. We will analyze two sim- 
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pie types of algorithms: valuate separately subsets inside which constrains doesn't 
work or just take a random order. We will get near optimum this way and explain 
why we can't get it in this way. There will be also described how to iteratively 
generate approximations of uniform distribution of elements on a finite set. The 
longer we will iterate the process, the nearer uniform probability we get. We can 
use this generator to find numerically approximation of the optimal description. 

In the seventh section it will be finally shown how to get as close to the optimal 
algorithm as we want. We will do it by narrowing the space, so that it has only one 
infinite dimension and divide it into " straps" , for which we can use one-dimensional 
analytical methods. There will be shown numerical results, showing that we are 
really tending quickly to the optimum in this way. 

2 One-dimensional model 

In this section we will look at the following transitively invariant model: 

Definition 1. One- dimensional model will be called a triplet: {X,A,M): 

space X = Tj 

alphabet A - finite set of symbols 
constrains M : ^ {0, 1} 

Now elements of this model are V{X), where for A C X: 

V{A) ■.= {v:A^ A\y^^AniA-l)M{v,, v^+i) = 1} (1) 

2.1 Blocking symbols to reduce constrains to range one 

I will shortly justify, that any general one-dimensional, translatively invariant model, 
can be easily reduced to above (with constrains of range one): 

Let I be the largest range of constrains, for example Vk — a ^ Vk+i — b. Take A^ 
as the new alphabet, grouping / consecutive symbols. 

Now we can construct matrix as above: 

^(»^i)i=l..iK).=l..! = 1 ^ 

^ (Vi=2..z Vi — Wi-i and the sequence viV2. --ViWi is consistent with the constrains) 
So we can restrict to the model defined above {I — 1). 

Let's visualize it to analyze some example: 
Definition 2. k-model: 

X — Z A = {0,1} constrain; after 1 follows at least k zeros. 
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Because constrains have range k, we should group k symbols into one of new 
symbols "0","1", ..."k": 

• "0" : there were at least k zeros before (there can be 1 now) 

• "z" : k — i + 1 positions before was the last 1 (in i positions there can be 1) 

In states different than "0", we have to put 0. 

So the whole algorithm (Markov process) is defined by the probability of putting 
1 while in state "0" - denote it g (g := 1 — g). Denote Pi - the probability of state i. 

{Pk =pq 
Pi = Pi+i for i G {1, A; — 1} 
p = Pi+pq 

So pi=p2 = ■■■ =Pk=pq 

p=l-pi-p2- ...-Pk = l-kpq P=iTk^ 

We will explain later, that in a symbol with probability distribution q/q is stored 
at average h{q) := — glg(g) — glg(g) bits. So the entropy of this model is the 
maximum over all algorithms: 

H = max Ha = max — f2) 

9e[o,i] ge[o,i] l + kq 

In the introduction we've seen the example that this method can be used to store 
more data on two-dimensional plane. We've just found analytic expression for the 
one-dimensional case - we have constant length of "magnetic dot" - say d, but we 
don't assume that potential positions cannot intersect (if we assume that - we can 
store Ibit/length d). We assume only that two 1 (magnetized dot) cannot intersect. 

Let say that we can position the dot with precision That means exactly 
that after 1 there have to be k zeros - analyzed model. We can now easily count 
that using k + 1 times more precise positioning, we get: 
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more capacity of information. For larger k we get fig. [2] It goes slowly to infinity 




Figure 2: Informational capacity in bits/'old node' for rescaled k - models 

with A; ^ oo - we could increase the capacity of information potentially as much as 
we need by more precise head positioning. 
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2.2 Maximal entropy random walk 

Let's think how much information can be stored in such sequence/path of length kl 
Denote Nk = #K(A;) ( A; := {0, 1, A; — 1}) the number of such sequences of 
length k. Wc can store some information by choosing one of them. To choose one 
of 2"' elements we need n bits. So in sequence of length k we can store about Ig(A'^fc) 
bits. The average number of bits we can store in one node is: H :— hmfe^oo H'^ 

:= 'li^nSl (3) 

Let's introduce vector c'^ := (c^)ag^, where := #{7 E V [k) : ■jk-i — a}. 
Sequences of length k + 1 are obtained from sequences of length k: 

Theorem 3 (Probenius - Perron). Irreducible matrix with real, nonnegative coeffi- 
cients have dominant real, nonnegative eigenvalue. Corresponding eigenvector has 
real, positive coefficients. 

In our case reducibility would mean that starting from proper subset of alphabet 
we have to stay in it - we could decompose it into irreducible cases. 
Assuming irreducibility, we can say: 

H = IgX (5) 

where A is the dominant eigenvalue of M. 



Look at the normalized corresponding eigenvector: = A0^, X^ae^i^a ~ ^■ 

We could think that it is probability distribution of symbols in elements... 

It's not true: it's the distribution of symbols on the end of a sequence which is 
unbounded in one direction. 

We can use this probability distribution to initiate algorithm we will find. 

Basing on above derivation we will find the probability distribution inside such 
sequences - unbounded in both directions. Now we have to expand in both sides. 

For some (t'i)i=o..m-i, k E N, a,b E A, consider all paths from a to 6 of length 
2k + m, which has v in the middle: 

pfc,a,6 ■- : rn + 2k ^ A, -fklk+i--lk+m-i ^ v, 70 = a, -f2k+m-i = b} 

We will call v a pattern, its domain (m) - its shape and the extremal nodes of 
the domain - its boundary: v :— Vq, v :— Vm-i- 

We want to count allowed paths among them. Define matrix C^: 

(C,%:=#{7er^'«'''ny(^^^+2^)}= Yl ^7071 ^717. •-^7.+..-.7.+..-i (6) 
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The second form uses that such multiphcation is equal 1 for allowed sequences and 
otherwise. This form suggests generalization to other matrices and will be considered 
later. It also suggests k ^ k + 1 iteration: 

For the dominant eigenvalue we have left /right eigenvectors: 

= MiP = XiP (7) 

This time we take normalization (p'^xjj — 1. 

Usually we will consider real symmetric matrices, like for undirected graphs, for 
which (j) = ip. Generally if is not real, wc should use 0^ instead of (f)'^ . 
M has unique dominant eigenvalue, so for large k we can assume: 

^ X'^^jcf)'^ (8) 

This asymptotic behavior is the key point: it allows us to average over infinitely 
long sequences. Presented approach can be generalized to degenerated dominant 
eigenvalue by projecting into its eigenspace, but only if all dominant eigenvalues: 
with the same absolute value has also the same phase. In other case the final value 
would depend on k. 

Substituting we get 

= X'^^P<fClij<f = X'\^P<f)<i>,^P, (9) 

because {C^)a,b = lifa = -u, h = v and otherwise. 

Let's look at the probability distribution of allowed patterns on the given shape 
(m) inside such infinite sequences {k — >• oo). 

Pv = lim 77^V~ = (^°) 

fc-*oo l^^^v{m) Z^a,beA\^w)a,b Z^weV{m) VwV^w 

Notice that we can forget about the normalization assumption in this equation. 



We get 

1. Patterns on the same shape and with equal boundary values are equally prob- 
able. In other words - after fixing values on the boundary of some set, proba- 
bility distribution of valuations inside is uniform. We will see later that it can 
be generalized into higher dimensions. 

2. For m = 1 we get the probability distribution of symbols: 
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3. For m — 2 we get 



4. The probability that from vertex a we will jump to vertex h is 

c Pab Mab Ipb 



^ab 



Pa A Ipa 



(13) 



The Markov process defined this way (P(a — > 6) = Sab) reproduces original statistics 
of the space of infinite allowed sequences with uniform probability distribution. In 
fact it gives uniform probability among finite paths also: for any two points, each 
allowed path of given length(A;) between them has the same probability 

P(path7o7i..7fc) = ^^ (14) 

Taking S'' or multiplying above by the combinatorial factor, we get: 

(c^k^ iM%b A ..... 

{S)ab-^^^^. (15) 

It suggest the statistical algorithm: after symbol a choose the next one with 
{Sab)b probability distribution. 

In this way we get uniform distribution among sequences - get maximal entropy. 
We will also calculate, that it gives maximal possible entropy lg(A). 

Firstly look at the problem: we have a sequence of and 1 in which the probabil- 
ity of 1 is fixed to some p G (0, 1). Let's calculate how much information corresponds 
asymptotically to one digit. Denote p :— 1 — p 



I n+l/2 ra 

^ i2n)-'/^ ^ ^ 



n 

pn) {pn)\{pn)\ (pn)P"+i/^(pn)P"+i/2e"' 

= (27rnpp)-^/2p-f"p-^^" = (27rnpp)~^/22-"(pigp+pigp) 

lg(n) 

Hp — lim ^ = — plgp — plgp =: h{p) (16) 

n— »oo n 

where we've used the Stirling's formula: lim„_,.oo , — \» = 1 

That means that when we choose from all sequences, which number grows like 
2", these with given asymptotical probability of 1 (p), we get 2"'*'^^^ sequences. If 
we restrict to sequences with uncorrelated bits with p = 1/2, wc would get every 
sequence with the same probability - the uniform distribution of sequences. While 
storing information in this distribution we get the maximum capacity: Ibit/digit. 
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a2 a4 ae a« i 



Figure 3: Entropy for fixed p. 

Inserting some correlations, favoring some symbols, or part of the space would 
create redundancy. 

Now when we have larger alphabet A and a probability distribution 
p : X^ae^ Pa — ^, the mean number of information per symbol is (by induction) 



Hp ^ - ^Pa'^gPa 



(17) 



it's the Shannon's entropy - in a symbol with given probability distribution p we 



can store Hp bits. 



In the case of found above Markov process, S — {Sab)a,beA' 

1- ^a,beA Mab > Sab>0 

2- ^aeA '^beA ^"-b ~ ^ 

3. pS^p, Eae^Pa = 1 

Generating path for a Markov process is a sequence of independent choices - the 
entropy of choosing one of sequences is the sum of entropies for single choices. So 
the average amount of information per symbol is: 



H = ~ y^Pa Sab Ig Sab = 
a b 

E Ma \- Mgb jjb , 1 ^, 



b a,o 



1] 



^Sj^ + J2^Mk^a)Mab^l^b - (l^aMabMkM = 

a,b 

■gA+ J^(0„(lgV'a)AV'6 - (l)aXMkA)) = IgA 



a,b 



We know that this is the limit for all allowed sequences, so among stochastic 
processes consistent with the graph {Sab ^ Mab) we cannot get higher entropy - 
encoding information this way gives the maximum capacity. 
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2.3 Generalization to other distributions of sequences 

We've focused on 0/1 matrices, but above derivations are more general. Analo- 
gously we can enforce that paths are not equally probable, but their probability is 
proportional to the sum (integral) of some potential along the way: 



P(path 7o7i...7fc) is proportional to e 



-(5V'(7o)+V'(70,7i)+V(7i)+-+V"(7fc-l,7fc)+5V(7fc)) 



where V, V are some freely chosen real vertex/edge potentials. 

In discretized euclidean path integrals, the potential says about the probability 
that the particle will decay in given place (for example by going to an absorbing 
vertex). In the presented approach, the particle doesn't decay. The potential says 
about the probability that the particle will use given vertex/edge. 



Looking at (|6j) we see that to achieve this potential, we should choose: 

M,, = e-(i^«+^'(^'^-)+i^(^-)) (18) 

So if we want to use only vertex (edge) potential, we can set V = (V = 0). If 
V is symmetric, ip = (p. To forbid some verexes/edges (like before), we can set 
their potential to infinity. We can also choose nonzero diagonal elements to allow 
self- loops. 

Now for a connected weighted graph Frobenius - Perron theorem still works: 
we get the dominant eigenvalue (A) and corresponding normalized right eigenvector 
'0 > 0, ipf = 1 and as before: 

e-(|^(7o)+V'(7o,7i)+V(7i)+-+V'(7fc-i,7fc)+|V(7fc)) ^ 

P(path7o7i,..,7fc) = rz ^ (19) 



A'^ V- 



70 



P. = ^t ^S'U='-^f^- (20) 

2.4 Infinitesimal limit 

To the end of this section we will informally derive infinitesimal limit with some 
time independent vertex potential density. We will do it by covering the space, let 
say M"^, with lattice and decrease its width to 0. 
This time we would like that (informally): 

P(path 7) is proportional to e"-^^^'''^*^^''* 

The problem is that such Brownian-like path has infinite length - this probability 
distribution has to be thought for example as a limit of made for some finite length 
approximations of paths. 
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For simplicity we will take diagonal terms of M equal 0, what corresponds to 
moving with constant speed. But we will see later that adding some constant on 
the diagonal doesn't change the results. 

This time V is some (smooth) function M.'^ —>■ M. Let's choose some time step 
e > and cover the space with lattice of some width 6 > 0. We will treat it as a 
graph, in which each vertex is connected with 2d neighbors. 

We assume that the potential around a node is constant and equal to the orig- 
inal potential V in this node. One edge corresponds to time e, so such discretized 
potential should be chosen as e times the original one. 

This discretized model is symmetric, so (p = ip. 
The eigenvalue relations {Mip = Xip) for one-dimension {d = 1) are: 

for all i, V'i-ie"5(^'-^+^') + ^.^le^i^^'+^'+i) = 

where ipi := 2dVi := V{i6) (to simplify the final equation). 

Now because e — >• 0, we can use e'" = 1 — e: 

(l - |(^.-i + Vi)^ + (l - '-{V, + \/,+i)) = A,7/>, 

- + ^i+i 1 2 A, 

9 + yim^i + {V'i + Vi+i)i>i+i} + -ipi = —ipi 

It looks like there is a problem with ^ipi term - it goes to infinity. But it adds only 
a constant to the operator - it doesn't change its eigenfunctions. The formula for 



the stochastic matrix (15) also won't be affected - this time the powers will become 



exponents and '^^^t(\+c) = The e*'^ term realizes the normalization of probability 
distribution. So we will be able to ignore this constant term of M later. 

We have to connect 5 and e. We see that to obtain 2nd derivative of ip, we 
should take e ~ 5^, what is characteristic for Brownian motion. To simplify the final 
equation, let's choose e = 25^ time scale. 

Let's multiply above equation by -1 and assume some smoothness of ip and V to 
write it for e 0: 

- 2^i + Ipi+i _ 2 - Ae 

h 2Vi^pi = ^pi 



252 

We can sum such equations for all dimensions and take the limit e ^ 0, getting the 
Schrodinger equation: 

= -^^i^ + vij = Eoij 

where A := J2^ dn, H:=-\A + V. 

We've already explained that we can ignore the constant term for M, so we can 
take M = -H. 
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This time £"0 = hm^^o ^^-^ is some new dominant eigenvalue of differential 
operator H, called Hamiltonian. Now the largest A corresponds to the smallest 
E. Corresponding cigcnfunction %lj should be obtained as the limit of succeeding 
approximations - can be chosen as real nonnegative - it's the ground state. 

The equation for probability distribution becomes 

p{x) = ip'^{x) where / il)'^{x)dx = 1 (21) 

which looks similarly to known from the quantum mechanics, but this time ip is real 
and nonnegative function. Sometimes ijj cannot be normalized, but it still can be 
used to calculate the propagator. 

The powers of M matrix becomes the exponent of the differential operator M. 
The stochastic matrix becomes propagator, which gives the probability density of 
finding a particle from x after time t: 

Kix V t) - ^ "^l""*^!^ ^ ^ (22) 
It's easy to check, that as we would expected: 

/ K{x,y,t)dy ^ 1, / K{x,y,t)K{y, z, s)dy ^ K{x, z,t + s), 



I 



p{x)K{x,y,t)dx = p{y). 



As it was already mentioned, e~*^" term is for normalization. The ip division 
term corresponds to time discretization, 1 think. Because ip should be continuous, 
for small times the particle moves corresponding to the local potential only. For 
larger times, this term starts to be important - now the particle doesn't just move 
straightforward between these points, but choose statistically some trajectory - this 
term corresponds to the global structure of potential/topology of the space. 



3 Asymmetric Numeral Systems (ANS) 

We will now show how to use found Markov process (or generally - statistical 
algorithm) to deterministically encode some information. Using the data, we have 
to generate succeeding symbols with given probability distribution {qs)s=o,..,n-i- 
To do it we could use any entropy coder, but in reversed order: encoding into 
Markov's process correspond to decompression, decoding to compression. 

In practice there are used two approaches for entropy coding nowadays: building 
binary tree (Huffman coding) and arithmetic coding. The first one approximates 
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probabilities of symbols with powers of 2 - isn't precise. Arithmetic coding is 
precise. It encodes symbol in choosing one of large ranges of length proportional to 
assumed probability distribution (g). Intuitively, by analogue to standard numeral 
systems - the symbol is encoded on the most important position. To define the 
actual range, we need to use two numbers (states). 

We will construct precise encoding that uses only one state. It will be obtained 
by distributing symbols uniformly instead of in ranges - intuitively: place infor- 
mation on the least important position. Standard numeral systems are optimal 
for encoding streams of equiprobable digits. Asymmetric numeral systems is nat- 
ural generalization into other, freely chosen probability distributions. If we choose 
uniform probability, with proper initialization we get standard numeral system. 

3.1 General concept 

We would like to encode an uncorrelated sequence of symbols of known probability 
distribution into as short as possible sequence of bits. For simplicity we will 
assume that the the probability distribution is constant, but it can be naturally 
generalized for various distributions. The encoder will receive succeeding symbols 
and transform them into succeeding bits. 

An symbol(event) of probability p contains \g{l/p) bits of information - it 
doesn't have to be a natural number. If we just assign to each symbol a sequence 
of bits like in Huffman coding, we approximate probabilities with powers of 
2. If we want to get closer to the optimal compression rates, we have to be 
more precise. We see that to do it, the encoder have to be more complicated 
- use not only the current symbol, but also relate to the previous ones. The 
encoder has to have some state in which is remembered unnatural number of bits of 
information. This state in arithmetic coder are two numbers describing actual range. 

The state of presented encoder will be one natural number: x E N. For this 
subsection we will forget about sending bits to output and focus on encoding sym- 
bols. So the state x in given moment is a natural number which encodes all already 
processed symbols. We could just encode it as a binary number after processing the 
whole sequence, but because of it's size it's completely impractical. In 3.4 it will be 
shown that we can transfer the youngest bits of x to assure that it stays in the fixed 
range during the whole process. For now we are looking for a rule of changing the 
state while processing a symbol s: 

encoding 

(s,x) ^ x' (23) 
decoding 
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So our encoder starts with for example and uses above rule on succeeding symbols. 
These rules are bijective, so that we can uniquely reverse whole process - decode the 
final state back into initial sequence of symbols in reversed order. 

In given moment in x is stored some unnatural number of bits of information. 
While writing it in binary system, we would round this value up. To avoid such 
approximations, we will use convention that x is the possibility of choosing one of 
{0, 1, .., X — 1} numbers, so x contains exactly lg{x) bits of information. 

For assumed probability distribution of n symbols, we will somehow split the 
set {0,1,.. ,x — 1} into n separate subsets - of sizes xo,..,Xn-i £ N, such that 
X]s=o -^s ~ ^- '^^^ treat the possibility of choosing one of x numbers as the 
possibility of choosing the number of subset (s) and then choosing one of numbers. 
So with probability Qs = ^ wc would choose s-th subset. We can enumerate elements 
of s-th subset from to a;^ — 1 in the same order as in the original enumeration of 
{0,l,..,x-l}. 

Summarizing: we've exchanged the possibility of choosing one of x numbers into 
the possibility of choosing a pair: a symbol(s) with known probability distribution 
{qs) and the possibility of choosing one of Xg numbers. This {x ^ {s,Xs)) will be 
the bijective coding we are looking for. 

We will now describe how to split the range. In arithmetic coding approach 
(Range Coding), we would divide {0, ..,x — l} into ranges. In ANS we will distribute 
these subsets uniformly. 

We can describe this split using distribution function Di : N — > {0, .., n — 1}: 

n-l 

{0,..,x-l}=[j{ye{0,..,x-l}: D,(y) = s} 

s=0 

We can now enumerate numbers in these subsets by counting how many are there 
smaller of them in the same subset: 

Xs := #{y e {0, 1, .., X - 1}, L>i(y) = s} D^ix) := (24) 

getting bijective decoding function(D) and it's inverse coding function (C): 

D{x) := {Di{x),D2{x)) = {s,Xs) C{s,Xs) := x. 

Assume that our sequence consist of n G N symbols with given probability 
distribution {qs)s=Q,..,n-i (Vs=o,..,n-i 9s > 0). We have to construct a distribution 
function and coding/decoding function for this distribution: such that 

Vs,a; Xs is approximately x ■ qs. (25) 

We will now show informally how essential above condition is. In 3.3 and 3.5 will 
be shown two ways of making such construction. 
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Statistically in a symbol is encoded H{q) := — Igg^ bits. 
ANS uses lg(x) — Ig(xs) = \g{x/xs) bits of information to encode a symbol s from 
state. Using second Taylor's expansion of logarithm (around Qs), we can estimate 
that our encoder needs at average: 



g,ln(2) 2g2 1n(2) 



1 - 1 



J2 ^"^'I'^'.nf bits/symbol. 



(26) 



g,ln(2) ' ^ 2g,ln(2) 

We could average 

^ E - = 1^ E I i^s/gs - c{s, x.)y 

over all possible Xg to estimate how many bits/symbols we are wasting. 
3.2 Asymmetric Binary System (ABS) 

It occurs that in the binary case, we can find simple explicit formula for cod- 
ing/decoding functions. 

We have now two symbols: "0" and "1". Denote g := gi, so g := 1 — g = go. 
To get Xg ^ X ■ Qs, we can for example take 

Xi := \xq] (or alternatively Xi := [xgj) (27) 

Xq = X — Xi = X — \xq] (or xq = x — [xq\ ) (28) 



Now using (24): -Di(x) = 1 4^ there is a jump of \xq] after it: 

s := \{x + l)g] — \xq] (or s := [(x + l)gj — [xq\) 

We've just defined decoding function: D{x) = {s,Xs). 

For example for g = 0.3: 



(29) 



X 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


18 


Xq 







1 




2 


3 




4 


5 


6 




7 


8 




9 


10 




11 


12 


Xi 









1 






2 








3 






4 






5 







We will find coding function now: we have s and Xs and want to find x. 
Denote r := [xg] — xq E [0, 1) 

s := \{x + l)g] — \xq] = \{x + l)g — [xg]] = [(x + l)g — r — xg] = [g — r] 



s = 1 -v^ r < g 
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and < r < g. 



' 3^0 + 1 " 

Q 

Finally coding: 

f [fi^l - 1 if s = / f 

For g = 1/2 it's usual binary system (with switched digits). 
3.3 Stream coding/decoding 

Now we can encode into large natural numbers (x). We would like to use ABS/ANS 
to encode data stream - into potentially infinite sequence of digits (bits) with 
expected uniform distribution. To do it we can sometimes transfer a part of 
information from x into a digit from a standard numeral system to enforce x to 
stay in some fixed range (/). 

Let the data stream be encoded as {0, ..,6 — 1} digits - in standard numeral 
system of base b > 2. In practice we use binary system (b — 2), but thanks of 
this general approach, we can for example use 6 = 2^ to transfer whole byte at 
once. Symbols contain correspondingly lg(l/gs) bits of information. When they 
cumulate into Ig b bits, we will transfer full digit to/from output, moving x back to /. 



• s = 1: Xi = \xq \ — xq + r 



X — 



xi—r 



because it's natural number 



s — 0: q<r<lsoq>l — r>0 
Xq — X — Ixq] — X — xq — r — xq — r 



X — 



xo + r xo + l 




(30) 



Observe that taking interval in form (Z e N): 

I ■.^{l,l + l,..,bl-l} (31) 
for any x eN we have exactly one of three cases: 

• a; G / or 

• X > bl — 1, then 3\ken [x/b''] e / or 

• X <l, then V(^^)g{o,..,fe_i}N 3\keN xb^ + dib^~'^ + .. + dke I. 
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We will call such intervals 6-absorbing: starting from any natural number x, 
after eventual a few reductions {x ^ L^/^J) or placing a few youngest digits in x 
{x —>■ xb + dt) we would finally get into / in unique way. 

For some interval(/), define 

Is^{x: C{s, x) e /}, so that / = |J C{s, Q. (32) 

s 

Define stream decoding: 
{(s,x)=D(x) ; 

use s ; (for example to generate symbol) 

while (x^ I) x=xb+' digit from input' } 

Stream coding(s): 
{while (x^ Is) 

{put mod(x,b) to output; x=Lx/bJ} 
x=C(s,x) } 




Figure 4: Stream coding/decoding 



Wc need that above functions arc ambiguous reverses of each other. 
Observe that we would have it iff for s = 0, .., n — 1 and / are 6-absorbing: 

I^{l,..,lb-1} Is^{h,..,lsb-1} (33) 

for some /, Ig G N. 
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We have: - 1) = *h = #/ = Kb - 1). 

Remembering that C{s,x) ~ x/qs, we finally have: 

Is-ks $^/s = /. (34) 

s 

We will look at the behavior of Ig x while stream coding s now: 

Igx — > ~ Iga; + lg(l/gs) (modulo lg6) (35) 
We have three possible sources of random behavior of x: 

• we choose one of symbol (behavior) in statistical(random) way, 

• usually are irrational, 

• C{s,x) is near but not exactly x/qs- 

It suggests that Igx should cover uniformly the possible space, what agrees with 
statistical simulations. That means that the probability that we are visiting given 
state X should be approximately proportional to 1/x. 

Thanks of this observation we can for example estimate the number of wasted 



bits/symbol (26). Mean value of {x/qs — C{s,x)y depends on used distribution 



function - for precise functions like in ABS, it's smaller than 1. More important 
is the mean value of ^ - it should be about (by integrating) j2 In b. We can 
manipulate / and b parameters to achieve wanted compromise between speed and 
precision of our encoder. 



We will now focus on the stream version of ABS. 

For practical reason we can take: 

1 = 2^ h = T" / = {2^,..,2^+"' - 1} (36) 
We have to check when Jq, Ji are 2'"-absorbing. 

Observe that D2{bl) G {bl — \blq\, \blq\} have to be the smallest number above 
correspondingly Jq or Ji - have to be equal Uq or hli. 
In both cases Jq, /i are 2'"-absorbing iff 

Iblq] = [2^+"'g] is a multiplicity of 2"" (37) 

So if we want to use formulas explicitly, the precision of q should be at most R bits. 

In implementation of data compressor using ABS, we can: 



calculate formulas for every symbol while processing data - it is more precise 
and we can transfer a few bits at once, but it can be a bit slower, and we need 
to be careful with (37), or 
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• store the coding tables in memory - smaller precision, needs memory and time 
for initialization, but should be faster and we have large freedom in choosing 
coding/decoding functions. For example by changing a few last symbols we 
can repair ( [37| ) for more precise q. 

Practical problem is that decoded and encoded sequences of symbols are in re- 
verse order - to use probability prediction methods, we have to make predictions to 
the end, than encode in backward order. Now decompression is straightforward. 
In Matt Mahoney's implementations (fpaqb, fpaqc on [H]) the data is divided into 
compressed separately segments, for which we store q from the prediction process. 

3.4 Asymmetric Numeral Systems(ANS) 

In the general case: encoding a sequence of symbols with probability distribution 
qo,..,qn-i for some n > 2, we could divide the selection of symbol into a few 
binary choices and just use ABS. We will see that we can also encode such symbols 
straightforward. Unfortunately I couldn't find practical explicit formulas for n > 2, 
but we can calculate coding/decoding functions while the initialization, making 
processing of the data stream extremely fast. The problem is that we rather cannot 
table all possible probability distributions - we have to initialize for a few and 
eventually reinitialize sometimes. 

Fix some /, b, (qs) and choose some G N : ~ Iqs, = I- 

We have to choose the distribution function (-Di) for x from / to 6/ — 1 - distribute 
{b — l)/o appearances of symbol '0',(6 — l)/i of '1', ... , {h — l)/n-i of 'n — 1'. 

We could do it using that D2{x) is about previously to choose the most 

appropriate symbol for succeeding x. It would require priority queue for symbols. 

In this section we will focus on a bit less precise but faster statistical method: 
fill a table of size (6 — 1)/ with proper number of appearances of symbols and for 
succeeding x take symbol of random number from this table, reducing the table. 
Another advantage of this approach is that after fixing (Z^), we still have huge 
(exponential in #/) number of possible coding functions - we can choose one using 
some key, additionally encrypting the data. 

Initialization: choose some G N : Is ^ Iqs, Yls = 1'-, 

(fe-l)«0 (6-l)/n-l 

in=(b-l) 1 ; symbols =(0,0,..,0,l,l,..,l,..,n — l,..,n — 1); 
For x=l to b*l-l 

{i=random natural number from 1 to m; 

s=symbols [i] ; symbols [i] =symbols [m] ; m — ; 

D[x] = (s,/s) or C[s,/s]=x 

ls++} 
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Where we can use practically any deterministic pseudorandom number genera- 
tor, like Mersenne Twister ([TO]) and use eventual key for its initialization. 
Practically without any cost we can increase the preciseness of this algorithm as 
much as we want by dividing / into subsegments initialized separately. 

Modern random number generators are practically unpredictable, so the ANS 
initialization would be. It chooses for each a; G / different random local behavior, 
making the state practically unpredictable hidden variable. 

Encryption based on ANS instead of making calculation while taking succeeding 
blocks as standard ciphers, makes all calculations while initialization - processing of 
the data is much faster: just using the tables. 

Another advantage of preinitialized cryptosystem is that it's more resistant to 
brute force attacks - while taking a new key to try we cannot just start decoding 
as usual, but we have to make whole initialization earlier, what can take as much 
time as the user wanted. 

We can use this approach as an error correction method also. For example by 
introducing a new symbol - the forbidden one and rescaling the probability of the 
allowed ones. Now if this symbol occurs, we know that there was an error nearby. 
Knowing the statistical error distribution model, we can create a long list of possible 
errors, sorted by the probability. Now we can try to correct as it would be this case 
for succeeding points of this list and verify by trying to encode the following message. 
In this way we use that the most of blocks are correct, so we can transfer surpluses 
of unused redundancy to help with pessimistic cases. 

We can also use huge freedom of choice for ANS to make the correction easier and 
faster - for example by enforcing that two allowed symbols has Hamming distance at 
least some number. We can for example get Hamming code this way as degenerated 
case - each allowed symbol appears once, so the intermediate state is just 1 - we don't 
connect redundancy of blocks. In other cases we can use the connection between 
their redundancy to cope with badly damaged fragments. 

4 Information capacity of general models 

In this section we will give formalism and methodology for general models and prove 
existence of entropy for for large class of them. 

4.1 Basic definitions 

We will operate on f : A A {Ac X) functions. They will be called as before 
patterns and their domain, denoted - their shape. 

Definition 4. Model will be called a triple {X, A, C) 
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space X C Z'" 

alphabet A -finite set of symbols 

constrains (by forbidden valuations): C G {c : A ^ A : Ac X} 
then elements of model are V{X), where for A C X its valuations are: 

V{A) :={v:A^ A\'^ceC:V(c)cA ^.evic)c{x) ^ v{x)} V := [j V{A) 

A(^X: #A<oo 

Digression: it is a very general definition. Instead of defining by forbidden states, 
we could do it by allowed ones (for example - take large enough subset of X and 
take all allowed valuations). 

We can write for example tiling problems in that formalism (denote each tile 
with a separate symbol and put their shapes in "allowed constrains"). We know 
that there can happen very different situations - even enforcing nonperiodic tiling. 

To control our model we will have to limit this class. 

The main example we will use is the Hard Square model. 
Definition 5. Hard-square model (HS): 

{X = Z\A= {0, 1}, C = {(((2, j), 1), ((^, 0, 1)) : hi. k,leZ,\t-k\ + \j - l\ = 1}) 

In each node of two-dimensional lattice we put or 1 such that there are no 
neighboring '1'. It is one of the simplest models which haven't been solved analyti- 
cally. In 01 we can find exact formula for entropy of similar Hard Hexagon Model 
(we add upper left and lower right to the neighborhood), but this methodology cre- 
ates some unsolvable singularities for HS. In |6j 43 digits of entropy of HS are found 
- I will use it to evaluate algorithms. 

Unfortunately this methodology cannot be used to find statistics. 

Define: 
Definition 6. 

Neighborhood of x E X: N^ := {y E X : 3cec C Ti[c)} 
Range of constrains: L := max{|?/j — Xj| : a; G X, y E N^, i G N} 
Interior of A C X : A' := {x E A : N^ C A} 
Boundary of A C X : A° := A\A- 
Thickening of Ac X: := J^.^^ N^ 
We will call set A C X connected, if 

^x,yeA^neN,zO,...,z"eX = X, z"' = y, Vj |z* — z'^^ = 1. 

j 

Of course A^+ C Ac 
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For translative invariant models, we need to know only the neighborhood around 
one point, e.g. for HS: = x + {(0, 0), (1, 0), (0, 1), (-1, 0), (0, -1)}. 

Digression: neighborhood is always symmetric set {x & Ny 4^ y E N^), x E N^, 

so 

p(x,y) = mm3(^o=^,^i,...,^n=j/) Vjx'+^ e N^^ 
is a natural metric for a given model. 

HS model has many symmetries: generated by translations, axial symmetry and 
rotations by n/2. 

Definition 7. A bijection S : X ^ X will be called symmetry if 

ceC ^ CO s eC (38) 

4.2 Existence of entropy 

We will now reduce the family of models we will focus on and prove the existence 
of positive average entropy for them. 

For the rest of this paper we Eissume, that: 

• X - Z"^ 

• has finite range of constrains 

• is translational invariant - every translation is its symmetry :— y + x) 

• is simple (#): 

Definition 8. We call model simple(^), if ^V{X) > 1 and 
there exists natural numbers N > n > 0, such that 

^connected 

where := A, A'+^ := {A')+. 

In other words: for sets "nice" enough (of the form A"), exists thickening {N — n 
times) that for any valuation of its boundary, every valuation of A"- is still allowed. 

For example models with neutral symbol - which isn't in any constrain (we can 
always use it: in HS) are simple(#): n — 0, N — 2 - we fill A'^\A with this symbol. 

To the end of this paper we use notation: 

= N{A, u) := e V{A) : w\viu) = u}, N{A) = N {A, {}) . (39) 
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Lemma 9. Denote fc-block; f3k '■= {0, 1, k — 1}™", now: 

3n'>max(L,l)Vt,ey(/3+\/3^,)iV(/5^',^^) > 1- 

Simply speaking: for every valuation of the neighborhood of large enough block 
we can valuate it in more than one way. 
Proof: Because ^V{X) > 1, we can choose k: #V"(/3'^) > 1. 

can be placed in some n' - block (n' > L). □ 

Theorem 10. For models as above, there exists entropy (H) and is positive: 
There exists increasing sequence of sets (A)-' Ai C X, jj^Ai < oo, \J^Ai = X, such 
that there exists a limit: 

Prove: 

Take n' like in the lemma. We will operate on blocks: B = (3n' placed in nodes of 
lattice n'Z™. 

Because n' > L, valuations of a block can be restricted only by valuations of 
adjoined blocks: B + n'({-l, 0, Ij'^VjO}). 




Figure 5: Block division for Ai, 6*2 , A2, C3, ^3. Black - "essence" . The rest (filler) 
we valuate such that all blocks denoted with the same color has the same pattern. 

First of all we will find the filling pattern - periodic valuation of the space. 

Take the lattice Y := 2n'7/^ and numerate anyhow {0, 1}™" = {x^}i=i,...,2"^- 
Now every block from Y + B + x^ can be valuated independently from the other 
(n' > range of constrains). Choose any Wi G V{B). 

Now using lemma [9| after valuating these blocks we can find some W2 G V{B) for 
Y + B + x'^ not colliding with all wi. And so on we get the universal periodic filling: 

V{X) 3 w : Vj/gy,a;gB,ie{i,..,2™} w{y + x + x') := Wi{x). 

Now we can go to the main construction. 
As the sequence we are looking for, let's take 

A:=B + n'{0,....2'r. - "^^^^ 
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Notice that Hi < - we would get this entropy without constrains. 

To prove that Hi has a hmit, we will construct increasing sequence H'-, that 

Vi H'i<Hi< #A and show that limj^oo H^ - Hi = 0. 

Monotone and bounded sequence has a limit, so Hi will have the same. 

We will make some construction to ensure the monotonicity: 

Ci:=B + n'{x E {0, 2^r : ^i^i ^ {0, 2^}} 

external blocks of Ai which will be filled with w. 
We will need intermediate step: 



C'i^, -.^3 + n'{x e {0, 2^r ■■ ^i^i e {0, 2\ 



(40) 



Of course H^' <H[<H,. 

Now the "essence" of Ai^i\C'ij^^ is made exactly of 2™ "essences" from the previous 
step, with the same valuation of surrounding blocks. 

So N{Ai^,Mc',J = T-N{AiMcd. = [%^) #Ai < 2-#A 

We get: H^ < H^^,, 

H[<H1^,< Hl^, < ... < 4A. 
There left only to show, that limj^oo — H'- — 

N-l n 

Look at A := (A\Ci)~ ~ --+ + ••• + . 

C ^), so from (#): 
we can freely valuate Di, independently to valuation on Cj. 
Because D f3k-2L + {I, I, I), so /32*-2iVL + (NL, NL) c A, 
for large i: 

4Di ~ (2T #(A\A) 

So hm,_oo = 0. 

We have N{Di) < N{Ai,w\c, < N{Ai) 
But j^l^'l < _ equality would be without constrains. 

IgiV(A-)-lgAr(A) ^ #(AA-Di)lg(#^) _^ Q_ 

Prom three sequences: limi_,.oo Hi — H^ — □. 
For large class of models we can speak about their entropy. 



For the rest of this work we add assumption of irreductibiUty: 
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Definition 11. We call translational invariant model irreducible, if: 

U i^y = ^ 

i 

Where A° := A, A^+^ := {A^)+ . 

If a model isn't irreducible (is reducible): Y := |Jj{0}* 7^ ^ 
Because Y = —Y, Y = Y + Y, so yEY^ yL C F - V is a periodic lattice - there 
exists linearly independent , ••, € such that: 

Y= %y^ 

1=1,. ..,m' 

Now if we make a transformation: {Sij)j=i,...,m' we get corresponding m'- 

dimensional irreducible model. 

xr^y^^x — yEY is equivalence relation, so X = IJ- + y (disjoint sum for 
some (xj)) can be treated as identical, independent lattices. 

The average entropy is the same as for the reduced model. 

5 Statistical approach 

Now we will want to find optimal statistical description, by averaging over all valu- 
ations, like in one-dimensional case. 

5.1 Statistical description 

Definition 12. (Statistical) description is a function: 

p:{{A,f):AcX, #A < oo, f : A ^ A} ^ [0,1] 

such that 

VacX, #A<ooVa;eX\^V/:^^^ PfU{ix,a)} = P f (41) 

P{} = 1 

where pj =p{V{f),f). 

It gives for each shape A, the probability distribution of valuations on this 
shape. Because of translational invariance, it will be shown that / doesn't depend 
on its position - for example p{01) denotes the probability that when choosing a 
node, it and its right neighbor are valued correspondingly to 01. 



The conditions (41) ensure normalization to 1. (e.g. p(01)+p(00)=p(0)). 
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Now we would like to take an average over elements, but we can count only 
valuations on finite sets - we have to choose some sequence of finite sets tending to 
the whole space. 

Definition 13. 

We call series of finite sets normal sequence of sets (/li)ieN if 

A0GA1GA2G ... 

For f,veV, AcX : V{v) fl V{f) = 0, V{f) C A, V{y) C A 
we will call its (A, v) approximation of optimal description: 

A,. NiAvUf) 

N{A,v) ^ ' 

Where N{A,u) = 4{w e V{A) : w\v{u) = u}. Denote: pf := 

Now for some B D A, v E V{B°), Df cA~, f E V{A): 

B,, ^ N{B,vUf) ^ ZueviAo\D^)^iB\A-,uUv)N{A,uU f) ^ 
N{B,v) N{B,v) 
V A,u N{B\A-,uUv)N{A,u) _ ^ . ^ . „ 

Where we've divided the sum over all valuations into the sum within and outside 
topologically separating set A°. 

We want to find the optimal description as the limit of succeeding approx- 
imations pf for sets from a normal sequence. We've just shown that we get ap- 
proximation for the succeeding set, as weighted average (^u(zv{a°\d^) Pu'" — ^ 
approximations from the previous one for the same pattern /. 

So going to the next set won't give worse approximation: 

A(iB^pj>p'^>p^>p') 

where 

-A • A.v -A A,v lA "A -A 

Pf :— mm p/ Pf := max p/ df :=Pf —Pr 
veviA") ■> •' veViA") ^ •' 

We've explained that 0?^ isn't growing in normal sequence. If we would prove 

that for some normal sequence (Ai), rf^' is decreasing to 0, than taking any other 

normal sequence because Vj3j Aj C Bj, lirai^oopf' = hnij^^oo ' • 

So this limit would be the only reasonable optimal description. 
Unfortunately I cannot prove formally its existence. I have to assume it: 
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Assumption 14 (*). For any pattern f, d^' — > for some normal sequence (A)- 

We can now define: 

Definition 15. Optimal description, is := linij^ooP/' 
where (Ai) - any normal sequence. 

5.2 Optimal description fulfills general Markov's property 

Standard Markov's property can be thought that knowing only both ending symbols 
of a sequence, we know the probability distribution of its interior. In the second sec- 
tion we'd shown that one-dimensional optimal description fulfills Markov's property. 
Now we will see that in the general case analogous property is fulfilled - knowing 
symbols on the boundary of some set, we know the probability distribution in its 
interior. As previously, it will be uniform distribution among possible valuations. 

Let's show before, that optimal description preserves symmetries 
S - a symmetry of model 

S' : V{X) V{S{X)) = V{X), S'{v){x) = v{S-^{x)) - bijection on V{X) 

Now for a normal sequence (Ai) and some pattern /, take (using freedom of choice) 

normal sequence : = S{Ai) and pattern S'{f): 

" 1^ MS'iw) e V{B,)} -^^'(^) 

Observation 16. For any symmetry of model S and patter f 

P"f = Ps'(f)- 

We will now discuss local optimality condition (LOG). 
While having finite space, we have finite (A^) number of elements - we can store 
there the largest number of information (lg(A)), if they have uniform probabilistic 
distribution - we don't favor any. We have analog to this condition in infinite space: 
when we valuate boundary of some finite set, all available valuations of its interior 
are equally probable - its LOG. 

Observation 17. Optimality of statistical description p is equivalent to LOC(Local 

Optimality Condition): 

for any B C X, #B < oo, A C 5", / G V{B): 



N{B,f\B\A) 



Proof: 
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1. Take p = p°, (Ai) - normal sequence, A, B, f like above 

N{A,f\B\A) = i^{u G V{A) : uUf\B\A e ViB)}NiAi,f) = N{BJ\b\a)N{A, f) 

„o ^ NjAJ) ^ NjAJM ^ pif\B\A) 

NiA,) ^^<^NiBJ\B\A)N{A,) N{BJ\b\a) 

2. assume now LOG ioi p : B C X : #5 < oo, V{f) C A = B- 

P{f)= E Pif^^)= E ^^^^''^f^W^)= E PM^r (44) 



we get the first equality using normalization (41), the second from LOG (we 
can fill A\V{f) in N{B, v U f) ways) 

Taking as B succeeding elements of some normal sequence, we get thesis. □ 

Remarks 



1. For a pattern /, A C X : V{f) C A~ we have(44): 



o o A.v 

Pf= 2^ PvPf 

veviA") 

Now take for example sequence Aq := 0, Aj+i := Af, from (*): p^''^ ^ P°f - 
dependence of probability distribution of distant nodes decrease: 
Assumption (*) is equivalent vanishing of long distance correlations. 

2. LOG is equivalent pLOC - point local optimality condition: 

VB:Ar°cBCX\{0}, #B<oo^veV{B)^a,beA PvU{{0,a)} = PvU{{0,b)} 

where A^^q = Ai'oVjO}, Nq - neighborhood of 0. 

It gives smaller set of conditions for optimality - we have only to check LOG 
for ^A = 1. We will use it later to generate approximations of the uniform 
distribution over elements. It says for example for HS that for any finite 
pattern: 

P 

Where "x" denotes some valuations outside Nq. 
formally: 

VAcX\iVo: #A<ooV„eF(A) Pf*Uv = Pf*Uv 
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where /* := 0\n,\{o} , /: := /* U {(0,a)} 
Other valuations of A^q enforce in the middle. 

Proof: by induction over ^A: for ^A = 1 - pLOC 
assume we've proved LOG for ^A = k ~ 1 
Take some A,B,v:i^A = k, B D A+, v E V{B\A) 
We want to show that for any f,gE V{A) 

PvUf = PvUg 

If there exists x E A such, that f{x) = g{x), than we move x to B 
and use induction assumption. If not, we make intermediate step to /' = 
f\{{x, f{x)} U {(x, g{x)} for some x E A. □ 

3. Take a sequence Aq := {0}, Aj+i := A^ 
than: 

lg(iV(A)) 



lim 



Proof: set i E Z 

Take any valuation of A°_^-^ , we have uniform distribution of available valu- 
ations on Ai. Using {n i N), we can freely valuate Ai_7v+,„, so we have 
more valuations than A^(Aj_Ar+„): 

veV{A,) 



(limj_,oo "^^li^^" = 1) and get the thesis. □ 



Now we repeat discussion from the end of proof of theorem [10 

#A. 



So we've justified that optimal description has the same average capacity 
as the model. 



5.3 Sequential statistical desription 

In statistical description we have some excessive information because of the normal- 
ity conditions. We can get rid of them in e.g. two ways: 

1. We can limit to patterns without fixed symbol (a). E.g. for HS, s : A — > pou 
Proof: We want to get probability of some / with k + 1 appearances of a, e.g.: 
f{x) = a, now: 

Pf = Pf\{ix,a)} - Pf\{i^,a)}U{{x,b)} ■ 

b£A\{a} 
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2. Sequential description : 

Take any numeration of nodes of the space: {x^}i=o,i,...oo — ^ 
and fix some a G A 

^v={vo,vi,...,Vk-i)eA'^ ^beA\{a} QbiV) — " 

Jrjv 

where V{fy) := {a;*}i=o,...,fe-i, fv{x') := Vi. 

We are taking successively and using valuation of previous nodes we get its 
probability distribution of valuations . 



Proof: We want to find some p/, 

3fc {a;o, --^Xk} D V{f), A = {xq, Xk} 

^ueV{A) Pu = Yl(lu(xi){{u{xo), ...,u{xi_i)) qa{w) = 1 - ^ qbiw) 

i=0 \ h&A\{a} J 

Pf^ Yl pf"^^ ■ 

ueV{A\D{f)):uUfeV{A) 

In the next section we will forget about assumption that there are only finite 
number of previous nodes. 

So we are able to generate elements with given statistical description - visit 
successively x^ and generate its valuation with appropriate distribution. 

Digression: assume now that we have some element f : X ^ A generated 
from uniform distribution (e.g. using optimal algorithm). For a given shape A, the 
value of /I A should statistically correspond to optimal statistical description. So 
assuming vanishing of long-range correlation, we should get optimal description just 
by "averaging" f\A+x over x G nX where n is some large number. The same would 
be for any translations of nX, so: 

We could get the optimal description from any random element: by taking "average" 
of f\A+x over all points of the space: x & X. 



6 Statistical algorithms 

Let's say we have a statistical description, the nearer to the optimal, the better. 
Now we want to construct an element using this description. If it would be optimal 
- we get uniform distribution of elements this way - we can store the same amount 
of information as model's capacity. 
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We can use sequential approach like in the previous section, but it would favor 
some points (e.g. the first). We would like to use transitional invariance of the space 

- we cannot assume that there were only finite number of points before. 

We are still assuming that long range correlations vanishes, so as an approxi- 
mation of the optimal algorithm, we can assume that probability distribution for a 
given point depends only on valuations of neighboring ones. 

Definition 18. Statistical algorithm is a pair {<,q): 

• " < " C - linear order X 

• qa : {(x, v):ve V{x<)} ^ [0, 1] {v U {(x, a)} ^ V ^ qa{x, v) = 0) 
such that Z)o6^?a(^>^) = 1; 

where x< :— {y & X : y < x}. 

Usually q will be translational invariant and depend only on neighboring nodes. 
Algorithm have to be consistent with the model - some q are enforced to 0. 

In practice we cannot just "start" algorithm with infinite number of previous 
nodes - we usually need some initialization - it will be discussed on the end of the 
next section. 

While generating an element, we will get some average entropy per choice (node) 

- we will count it on examples. By optimality of algorithm we will think: how distant 
is the capacity given by the algorithm to the real capacity (model's entropy). For 
multidimensional models we usually won't be able to construct optimal algorithm, 
only approximate it. 

We will analyze now two simple algorithms for Hard Square model. 

6.1 Algorithm I: Filling over independent sets 

Divide the space into separate subsets Y^, inside which constrains doesn't work (each 
valuation is allowed). 

For example generally for given range of constrains L, take 

Y = I = {1, L-}, {x%a = {0, L - 1}-, now Y,:^Y + x' 

For HS we can take Y^ = {{x, y) : mod(a; + y,2) = i} - nodes with even/odd sum of 
coordinates. 

Algorithm: fill Yq using some simple probability distribution (e.g.: with prob- 
ability q put 1). Now do the same with Yi ( with probability q'). This time only 
some of nodes can be valued to 1. This second valuation doesn't infiuence any more 
nodes - we can store here as much information as possible - take = 1/2. 
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So our algorithm is described by one number: q. 

Let's calculate the average entropy. Because in one half of nodes we get entropy 
h{q) and in the rest, Ibit/node if it's possible - all neighbors have zeros (probability 
(1-9^)): 

H,= lhiq) + lil-qr 

The best capacity we can achieve this way is max^ Hg = H — 0.0217 
So we lose about 4% of capacity: AH = 0.0217 bit/node. 

Why we couldn't get the optimum? 
We can take q to fulfill relation: 

\ /I 

1 , but then e.g. p\0 0\ >p 
J V 

Explanation: If the middle node is in Yi - we have equahty in both cases{q' — 1/2). 
The strong inequality is got, if the middle is in Yq. Probability that we value the 
middle to 1 is g (probability on the right side). If we value it to 0, we'd have to ad- 
ditionally value its neighbor to 0, which is more probable when we fix 1 in the corner. 

Remind that we have countable number of equalities to fulfil (pLOC), so usually 
finite number of parameters won't be enough - we rather cannot get practical optimal 
algorithm, only its approximation. 





6.2 Element generator using thermalization 

To study the next algorithm, we will need to generate valuations on a finite set (^4) 
with probability distribution close to uniform. We could do it using approximations 
of statistical algorithm. In presented here method, we will do it straightforward 
- by allowing random evolution. This process will approach to kind of thermal 
equilibrium - we can get to the uniform distribution as close as we want. The 
disadvantage comparing to the statistical algorithm, is that this time we have to 
visit every node many times - using this method for coding is highly unpracticc. 
But we can use this generator for example to find the optimal statistics, but it 
occurs that the approaching is slow. 

It is based on pLOC - we will equalize iteratively probabilities, by enforcing 

some random fluctuations: 

For HS after any initialization from V{A) (e.g. using some statistical algorithm) 
Repeat many times: choose randomly a point in A. If all of its 4 
neighbors has value, then change its value (0 <-> 1). 
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Let's explain why we will tend to the uniform distribution. 
We are making some Markov process on V{A). We can go from / G V{A) to 
g e V{A) in one step only if they differ on exactly one node - the probability of this 
transition is ^ for both directions. 

So the stochastic matrix describing this process is symmetric - bistochastic - 
(1,1,...,!) is the dominant eigenvector corresponding to eigenvalue 1 (matrix is 
irreducible) - iterating this process we will tend to the uniform distribution. 

Generally we see that any bistochastic process would be appropriate. 

Practically I've used about 2-5 ^A iteration to get the first valuation and than 
to get next uncorrelated about #A iterations. 

We can now analyze "intuitively optimal" algorithm - with random order. 
6.3 Algorithm II: Random seed 

For every node take with uniform distribution a random real number from [0,1], 



It defines our order: x < y = t{x) < t{y) 

So in the following step of the algorithm we will take random, unvisited node: 

- if it has a neighbor valuated to 1 - wc valuate it to 

- else - we valuate it to or 1 with given probability. 

We could chose this probability as a constant, but we will be more sophisticated. 
When we are in a point with some that means that statistically t of nodes were 
already visited. So we can take some function: 

charging profile - g : [0, 1] ^ [0, 1] - if we are in a node with probability q{t) value 
it to 1, if we can. 

To calculate entropy, we need to find for a given q a second function: 
a : [0, 1] — > [0, 1] - probability that a node with given t can still be valuated to 1. 

Assume that a, q are continuous. 
Now entropy will be: 



t:X ^[0,1] 




(45) 



Notice that for optimality we would need: 



• a(0) = 1 - we have no 1 yet 



• q{l) — ^ - these 1 are not blocking anything 
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• a(l) = 2p° - they are neighbored by 4 zeros 

• 9(0) = P* - this elements will be 1 for sure (a = 1) 

Unfortunately finding a from q seems very difficult, so finding optimal q seems even 
worse. 



We can approximate it numerically, using Monte-Carlo type method: 
On some finite set (say: a square), generate different random orders and valuations 
with uniform distribution (using found generator) and take the averages to find q 
and a. On figjojare shown results {A = {1, ...,300}^, 4000 measures). 




Figure 6: Numerically found q i a. 

This graphs are very blurred. It's because we've simplified: q parameter should 
depends on the order of neighboring nodes. This algorithm looks translative invari- 
ant, but in fact only choosing the order is so. 



After fitting 4th order polynomials and integrating (45), I've got entropy larger 



then model's. It's the lesson that found a doesn't corresponds to q now. 

We can do it exactly by generating valuations using found q: I've got about AH 

0.01 - 0.02bit/node. 



7 Approaching optimum 

In this section will be shown practical method which (if (*) is true) can gives us 
description as near to the optimal one as needed. We can use it to encode information 
with practically real model's capacity. We will show numerical results for HS - there 
is very good tendency to the optimum. 

7.1 Method 

The idea is to approximate the model to be able to use one-dimensional solution: 
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1. Approximation of model - all dimensions but one (we will call it essential) are 
shrunk to a finite width (n), 

2. New alphabet - all allowed valuations of cross-section orthogonal to the essen- 
tial direction of width: range of constrains (L), 

3. Transfer matrix: introduce (transfer) matrix of allowed succeeding (in essential 
direction) new symbols, 

4. Solution to one-dimensional model - as in section 2, 

5. Find algorithm - using found description. It will fill lines succeedingly, 
treating every node in the same way, 

It's good to make one more step: 

6. Algorithm evaluation - use it to reconstruct the real statistical description used 
and calculate entropy. 

We will go through these steps for HS: 

1. Fix (1,0) as the essential direction on Z^. Fix width (n). 

y = Z(l,0) + (l,l)n, 

where n — {0, n — 1}. 

We have to chose some boundary conditions. We can do it e.g. in 2 ways: 

(a) cyclic: \/iv{i + n,n) = v{i,0), 

(b) zero: \/iv{i,—l) = v{i,n) = 0. 

We can think about cyclic conditions as additional constrains, so we get smaller 
entropy than original. 

Zero conditions - the space is split into straps - we have all constrains instead 
those between straps - lines ni and -|- 1 (i e Z) - we reduce number of 
constrains - increase entropy. 

So choosing proper boundary conditions we can bound entropy from below or 

above. 

In this paper we are not interested in calculating entropy, but in statistical 
description - we want alphabet to be as small as possible. So the best will 
be cyclic conditions - thanks of symmetries, we will be able to identify many 
states. 
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2. We are interested in valuations of (1, l)n - we don't have constrains inside - 
new alphabet has 2" symbols: A' := (1, l)n {0, 1} 

but we can identify v,w E A', such that: 

(a) cyclic translation: 3^ Vj v{i) — w{i + k mod n) 

(b) symmetry: Vj v{i) — w{n — 1 — i) 

(c) unimportant zero: v{0) — v{l) — v{2) — 1 A w{l) —OA Vj^iVj = Wi 
We have first two conditions from the cycle symmetry. 

The third says that two 1 blocks neighbor of node between them - its value is 
unimportant. This condition is the advantage of taking diagonal - the number 
of symbols behave like 1.5" and for vertical straps: ip"- = 1.618". 
Using this identifications for example for n = 21 we get 3442 symbols instead 
of 2^1 - 609 times less. 

3. M^^ = l^ueV 

where := {(0, 0), (1, 0)} + (1, l)n, u{i,i) := v{i), u{i + := w{{). 

4. For M find the dominant eigenvector (e.g. using power method) and the 
method from the second section gives the probability distribution of two suc- 
ceeding straps. 

5. Algorithm III: straps of precision (a,b) 

Order: {iJ)<{kJ)^{i-j<k-l)\/{i-j = k-lAi< k) 

g:{0,irx{0,l}^^[0,l] 

q{u, v) = if there are no neighboring 1 and 

u - valuation of previous a nodes, 

V - valuation of b nodes which will infiucncc to following nodes, 
then with this (q) probability valuate this node (?) to 1. 



We get q using two-straps probability distribution - look fig. 7. 




Figure 7: Finding q ior a = b = 2. 

Here are some algorithms as an example, found this way {n — 24): 
(a) a = 6 = 5 = 0.3602994 
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(b) a = b = l 



Q = 



f 0.3365899 0.2946553 \ 
V 0.4406678 0.3999231 J 



(c) a = b = 2 



Q = 



I 0.3350090 0.2918920 0.3265314 

0.3507870 0.3075943 0.3423356 

0.4419816 0.4001011 0.4342912 

\ 0.4421921 0.4004149 0.4345211 



0.2910261 \ 
0.3067152 
0.3992940 
0.3996084 / 



Where the number of line denotes u (00?, 10?, 01?, 11?), column - v analogically. 

6. Algorithm evaluation. 

We need to find statistical the description that is really archived using given 
algorithm. It should be "similar" to used to find the algorithm(we will use it 
as the starting point), but while constructing the algorithm, we've lost some 
information. The description we are looking for, should be fixed point of 
iteration below (figjs]): 



Figure 8: Iteration transforming algorithm into description 

Where the probability distribution for d is given by statistical algorithm. 
There is a problem with c - we cannot find it straightforward. Although we can 
find its distribution, assuming that we've already found good approximation 
of probability distribution of strap valuation. 

Algorithm: 

(a) as starting description take used to find algorithm 

(b) while in following iteration we get description more distant than some 
fixed boundary: 

i. using actual description find probability distribution for (1, l)k 

ii. using algorithm and this distribution make some iterations from the 



While having statistical description we can count capacity we get this way. 
Then we can compare it to the result from - 43 digits of model's entropy. 



c 



d 



figi 
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Here are results: — logiQ(if — Hg) for different a, b: 



a \ b 





1 


2 


3 


4 


5 





2.06 


2.113 


2.1153 


2.1153 


2.1153 


2.1153 


1 


3.32 


3.82 


3.99 


4.001 


4.002 


4.003 


2 


3.50 


4.37 


5.19 


5.44 


5.466 


5.436 


3 


3.50 


4.42 


5.55 


6.43 


6.74 


6.78 


4 


3.50 


4.42 


5.58 


6.71 


7.60 


7.96 


5 


3.50 


4.42 


5.58 


6.74 


7.83 


8.72 



7.2 Initialization 

To use found algorithm in practice, we still need to initiate it. We could valuate 
the first strap anyhow and in a few straps we would tend to assumed statistical 
description - we loose only some information on the boundary. 

But assume we would like to use the whole space optimally. 
On the first look - to valuatc the first strap wc should use the probability distribution 
for one strap. But on one side of this strap there will be not constrains - we should 
be able to store here a bit more of information. 

So for the first strap, we should use the statistical description for straps following 
strap filled with zeros (or suitable boundary conditions): p{v) = Sq^. 
Second: Straps (first (f),sccond(-u)) should have probability distribution: p{v,u) = 
SqvSvu - we can find statistical algorithm for second line from this distribution. 
And so on. Of course we are tending to original algorithm this way. 

8 Conclusions 

• For one-dimensional codings, we can analytically find the optimal statistical 
description - we can encode it with full capacity, 

• We have simpler alternative for arithmetic coding, which can be used to quickly 
compress, encrypt and add redundancy for correction in the same time, 

• We have criteria to ensure that given model has average informational capacity, 

• For models in which long range correlations vanishes, we can speak about opti- 
mal statistical description, which gives us uniform distribution over elements, 

• If we don't need exact optimality, we can use some simple algorithms, like 
filling over independent sets or over random sequence, 

• We can generate approximations of uniform distribution on finite set. 
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• Sometimes we can find algorithm as close to optimal as needed - there are 
results for Hard Square model in this paper. 
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